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specifying interactions of ILK with downstream, cytoplasmic or 
cytoskelctal proteins. Reduced ECM adhesion by the p59" ,K 
ovc (expressing cells is consistent with our observation of 
adhesion-dependent inhibition of ILK activity, and suggests that 
p59" K plays a role in inside -out integrin signalling. Furthermore 
the p59 ,Ik '-induced f anchorage-independent growth of epithelial 
cells indicates a role for ILK in mediating intracellular signal 
transduction by imcgrins'** 1 *. □ 
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Experimental 1 "* and simulation 7 studies show thai small mono- 
meric proteins fold in one kinetic step, which entails overcoming 
the free-energy barrier between the unfolded and the native 
protein through a transition state"' 9 . Two models of transition 
state form at ton have been proposed: n 'nonspecific* one in which 
it depends on the formation of a sufficient number of native-like 
contacts regardless of what amino acids are involved 1 *"'*, and a 
Specific* one» in which it depends on formation of a specilit subset 
of the native structure (a folding nucleus) 8 * 1 ' ,u . The latter requires 
that some amino acids form most of their contacts in Ihe transi- 
tion state, whereas others only do so on reaching the native 
conformation. Jf so, mutations affecting the stability of the 
transition state nucleus should have a greater effect on the 
folding kinetics than mutations elsewhere, and the residues 
involved should he evolutionary conserved. J^Utice-modcl sinui* 
lations mid experiments 8 ''*"'* suggest that such mutations exist. 
Here we present a method for determining the folding nucleus of a 
protein with known structure with two-state folding kinetics. This 
method is based on the alignment of many sequences designed to 
fold into the native conformation of a protein to identify the 
positions where amino acids are most conserved in designed 
sequences. The method is applied to chymotrypsin inhibitor 2 
(CI2), a protein whose transition state has been previously 
studied by protein engineering 14 " 16 . The involvement of residues 
in folding nucleus of CI2 is clearly correlated with their con- 
servation in design, and the residues forming the nucleus arc 
highly conserved in 23 natural sequences homologous to CI2. 

We first studied a simple lattice myclcl of the protein 17 . (1) We 
chose an a rbilmry conformation of lattice protein chain to serve as 
the native structure and then (2) selected sequences that deliver 
low energy in this 'native* structure compared to unfolded and 



misfolded conformations. (3) The designed sequences (from 2) 
were folded using Monte Carlo simulations. Because folding 
simulations and sequence design are carried out using the same 
set of potentials, a self consistent study of the model can he 
carried out with any choice of potential. For some (but not all) 
predictions on real proteins it is possible that the particular choice 
of potentials is not very important 17 (see below). 

Following this method we chose the target conformation 
(Fig. J), and designed sequences to fold to it. Further, the analysis 
based on Monte Carlo folding simulations 8 revealed the folding 
nucleus (Fig. Jrr) for both sequences shown in Fig. \b,c. 

The stochastic design algorithm generated 10 6 sequences, each 
having low energy in the conformation shown in Fig. Irr. Align- 
ment of these sequences revealed a remarkable feature: correla- 
tion between residue conservatism and its participation in the 
folding nucleus (Fig. 2). Indeed, nil four most conserved residues 
(5,16,20,35) belong to the nucleus (see Fig. \a\ This amid have 
been clue to the. degree they are buried in the native st mature (see 
Fig. .1), but this is not the case because the conservation of all buried 
residues is smaller than that of nucleus residues (see Fig. 2/>). 

The key feature of the nucleation mechanism is the existence of 
'kinctically important' positions. To test this we designed a set of 
sequences where positions (5,16,20,35) inside the nucleus are 
constrained to "alanines', which destabilize the nucleus (in an 
MJ parameter set). We studied the folding kinetics for these 
sequences (Fig. 3). The existence of some 'alanine' residues in the 
nucleus caused a pronounced slowing down of folding. This is an 
exclusively kinetic effect: several 'wild-type' sequences (designed 
without forcing 'alanines' into the nucleus) have native-state 
energy comparable to that of sequences designed with alanines 
in the nucleus. This example implies that the present method of 
identifying n folding nucleus would give wrong predictions for 
sequences with alanines designed in the nucleus, because these 
sequences do not appear to fold through a specific nucleus 
mechanism. Therefore the present method only applies to fast- 
folding sequences. 

We applied this method to predict the folding nucleus of C12. 
originally studied in refs 15. 16. To this end we used the real off- 
lattice structure of C12 as the target and designed sequences that 
have low energy in this conformation. There is a clear qualitative 
correlation between site conservation and ^-values for folding. 
Our calculations (see Fig. 4) predict that most conserved residues, 
particularly A35, 139, L68. 170 and 176. are likely to belong to the 
folding nucleus. After this work was completed, we learned from 
A. Fersht (personal communication) that A35 (not studied in refs 
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FIG. .1 a, The conformation of the 48-mer chosen as the native state in our 
design/folding procedure. Each ammo-acid residue is represented as a bead 
occupying n lattice site. Although the model does not treat side chains explicitly, 
the amino acids are chemically different; their differences are manifested in 
pairwise interaction energies of different magnitude and sign, depending on the 
identities of interacting amino acids. A conformation is described by the set of 
coordinates of olt monomers {/}. The energy of a conformation is: 

where A(r, - fj) = 1 if monomers i and | are lattice neighbours not connected by 
a covalent bond, and 0 otherwise. 60f. 0 is the magnitude of interaction 
between amino acids of type ;> and There are 20 types of amino acid in the 
model. Two parameter sets, 8, were used: one proposed by Miyazawa and 
Jemigan 1 ' (Ml), and the other proposed by Kolinski, Godzik and Skotnick* 0 
(KGS). the parameters given in Table 6 of ref- 19 were used as the MJ set. 
They represent the 'excess* pairwise interaction between two amino acids as 
compared with their averaged interaction with their environment. The values of 3 
in equation (1) were shifted and normalized to achieve a zero average over oil 
possible contacts and standard variance of unity. This amounts to multiplying all 
parameters in refs 19, 20 by a constant factor and adding a constant to each 
interaction parameter H. This procedure effectively sets the temperature scale 
for simulations (more details in ref. 8). For each set of parameters we designed 
sequences to fold to Uie conformation shown here. The design procedure has 
been described in detail elsewhere 17 - 31 " It is a stochastic (Monte Carlo) 
optimization routine in sequence space which keeps ammo-acid composition 
uncttanged. it minimizes energy of the native conformation. The condition of 
constant ami no-acid composition makes it equivalent to optimizing the relative 
energy ot the native state, or Z-score 2 *. As is characteristic of Monte Carlo 
searches, unfavourable mutations can also be accepted, with a small prob- 
ability, given by a Metropolis criterion** with selective temperature 7^, We chose 
~ 0.15 On our temperature scale), which is sufficiently tow to generale 
stable and fast-folding sequences, o, c, Two sequences designed to fold and be 
stable in the conformation shown in a: with KGS parameters ip), and with MJ 
parameters <c). The design tends to place the most strongly interacting amino 
aciris in the interior whero they can form most contacts. The strongest 'excess* 
interactions in MJ parameters from ref. 19 are between 'charged* groups (D nnd 
K) therefore they are buried in sequence (c)„ Alternatively, the strongest 
interactions in the KGS set are between hydrophobic groups, hence designs 
with these parameters yield more realistic sequences with hydrophobic groups 
buried, as in sequence o. Monte Carlo folding simulations were carried out for 
both sequences. Both of them folded to and were stable in the conformation 
shown in a with their respective forcelields. The nucleus (determined from the 
folding simulations, using the method described in detail in ief. 8) was identical 
for both sequences; it is shown by broken tines in a. 



15, 16) is ihe residue most involved in the nucleus with 4> — 1-0 
(sec also ref. 14): I7fS looks like tin exception with 4 - M3. 
However, it is not: despite the low Rvalue, recent measurements 
on a number of residues contacting 1 76 strongly suggest that 176 is 
also a key residue of the folding nucleus 14 . 

We also found a striking relationship between the residue 
conservatism in our design and evolutionary conservatism in the 
alignment of natural sequences homologous to CI 2 (Pig. 4b) 1 *. 
The key nucleus residue A35 is 100% conserved. Among the most 
conserved in Fig. 4b are nucleus residues 176, 139. Nucleus L6S is 
moderately conserved in the alignment, though, because of more 
frequent substitutions of L to V. 

This finding also suggests that our design procedure might 
reflect on some features of the evolutionary process of protein 
morphogenesis, namely these aspects related to folding. It is 
possible that some of the surface residues are evolutionary 
conserved for functional reasons not taken into account in the 
design procedure. To this end, the comparison of the results of the 
design (Rg. 4a) with the alignment (Fig. 4b) might be helpful in 
identifying which residues arc conserved for 'folding' reasons and 
which ones are functionally conserved. 

The possible physical rationale for the suggested method is that 
the sequence design identifies a contiguous cluster of core resi- 
dues which are close enough to each other in space to form a 
contiguous network of interactions. The energy-based design 
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Sequence entropy 

FiG. 2 a. The sequence design entropy is defined as: 

S(i) - -^p.tOln-P.O) ( 2 ) 

where p,{i) is frequency of occurrence of amino acid of type j at site i; 
m 20 is the total number of amino-acid types. To estimate the frequen- 
cies ft(J), we performed long runs of the Monte Carlo sequence design 
algorithm using a tow selective temperature {T^ ~ 0.15) to obtain ~10 c 
sequences per parameter set. We estimated the desired frequencies as the 
fraction of this population of designed sequences bearing an amino acid of 
type j at position i. Shown is a plot of S(ii as a function of position i for the 
KGS parameter set; the corresponding plot for the fvi.1 parameter set is 
similar (not shown), b. Histogram for the distribution of design entropy for 
burred residues (having three or four non-covalent contacts in the native 
structure. Fig. la) from the plot in a. White bars, all buried residues; grey 
bars, those of buried residues that have two or more nucleus contacts. 
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FIG. 3 The mean first-passage folding time for different sequences 
designed Nth MJ parameters) to folcJ into the native structure shown in 
Fig. Monte Carlo folding simulations were done at a low temperature 
t T ~ 0.8) so that Uie folding time of the 'wild -type' sequences (black circles) 
would not depend dramatically on native- state energy 7 "'. This is indeed the 
case in the present simulations: 11 'wild-type' sequences having different 
energies in the native state (black circles) have similar average folding 
times. Low temperature for folding simulations was chosen to distinguish 
the special rote of nucleus; residues in folding from the more obvious 26 " 39 
relation between folding rate and Ihe stability of the native stale. In the 
parameters from ref. 19, alanine interacts unfavourably with most other 
amino acids so that its placement in the nucleus destabilizes the latter. 
Several sequences were designed to have an alanine residue at the 
predetermined nucleus positions. {Alanine residues were placed at these 
positions and mutations were forbidden there, otherwise the design 
algorithm proceeded as usual.) Bfack squares correspond to sequences 
with one atenine residue, placed in either positions 5,16,20,35; grey circles 
cof respond to sequences with 2 fixed alanine residues, in positions (5,165, 
(5,20), (16,20), (5,35), (16,35) or (20,35); grey squares correspond to 
sequences having 3 feed alanines placed in all triplets out of positions 
(5,16,20,35), and the large grey circle corresponds to the sequence 
designed with fixed alanine residue at all four positions (5,16,20,35). 

procedure conservatively places into these positions residues that 
strongly attract each other. This creates a relatively low free- 
energy set of partly folded con form a Hons in which mutually 
stabilizing strong nucleus contacts are formed while other parts 
of the chain arc disordered. In the model representing folding as 
kincticalfy two-slate, such a set of conformations serves as a 
saddle-point in the free-energy landscape, which is the transition 
state. This also suggests that our current method may he applic- 
able only to proteins with twe-state folding kinetics. 
Note added in proof: The notation for CI 2 residues in rcf. 14 is 
shifted by 19 units from the notation used in ref. 15 and this work; 
for example, A35 here would be A16 in ref. 14. Q 
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FIG, 4 The design entropy of CI2. The PDB-sUucture of C)2 (2ci2) was taken 
as the target conformation. Then, using an MC sequence design algorithm, 
we generated many sequences (with the same ammo-acid composition as 
the native C12) exhibiting a low energy in this target conformation. The 
energy function for design was calculated for contact using equation (1). 
Two residues were considered to be in contact if the distance between their 
C e atoms (C 4 for G) was <7.5 K and if they were more than two units apart 
from each other along the sequence. KGS and MJ parameters were used tor 
80^) (see equation (1)). Amine-acid frequencies were taken from the 
designed sequences (as explained in Fig. 2) and substituted into equation 
(2). We compared these results with the numbers of native- like contacts of 
a given residue in the transition state relative to that in die native state (</>» 
values 30 - 3 *). Labels represent the ^-values for all residues studied in refs 15. 
16. the shaded label is the recent result of ref. 14 which only came to our 
attention after this study was completed. The plot shown here was 
calculated using the KGS parameters. The MJ parameters yielded very 
similar results indicating the same positions as conservative ones. Design 
with KGS parameters placed predominantly L and 1 in the conserved 
positions, which coincided in most cases (except position 35} with what 
is seen in real sequences. MJ parameters placed charged groups in these 
positions, for the reasons explained in the legend to Fig. X.b, The alignment 
entropy (amino-acfd variability) calculated over 23 sequences homologous 
to CI 2 (ref. 18). Frequencies of amino-acid occurencies were taken from Uie 
alignment, and sequence alignment entropy was evaluated according to 
equation (2). (These data were already calculated in the file we used, 
2ci2.hssp (ref. 18).) 
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